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To assess nonexperimental (NX) evaluation methods in 
the context of welfare, job training, and employment 
services programs, the authors reexamined the results of 
twelve case studies intended to repHcate impact esti- 
mates from an experimental evaluation by using NX 
methods. They found that the NX methods sometimes 
came close to replicating experimentally derived results 
but often produced estimates that differed by policy- 
relevant margins, which the authors interpret as esti- 
mates of bias. Although the authors identified several 
study design factors associated with smaller discrepan- 
cies, no combination of factors would consistently elimi- 
nate discrepancies. Even with a large number of impact 
estimates, the positive and negative bias estimates did 
not always cancel each other out. Thus, it was difficult to 
identify an aggregation strategy that consistently 
removed bias while answering a focused question about 
earnings impacts of a program. They conclude that 
although the empirical evidence from this literature can 
be used in the context of training and welfare programs 
to improve NX research designs, it cannot on its own jus- 
tify the use of such designs. 
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Assessing Alternatives to 
Social Experiments 

Controlled experiments, where subjects are 
randomly assigned to receive interventions, are 
desirable but often thought to be infeasible or 
overly burdensome, especially in social settings. 
Therefore, researchers often substitute non- 
experimental (NX) or "quasi-experimental" 
methods, in which researchers use treatment 
and comparison groups but do not randomly 
assign subjects to the groups.^ NX methods are 
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less intrusive and sometimes less costly than controlled experiments, but their 
validity rests on untestable assumptions about the differences between treatment 
and comparison groups. 

Recently, a growing number of case studies have tried to use randomized experi- 
ments to vahdate NX methods. To date, this growing literature has not been inte- 
grated in a systematic review or meta-analysis. The most comprehensive summary 
(Bloom et al. 2002) addressed the portion of this literature dealing with mandatory 
welfare programs. However, efforts to put the quantitative bias estimates from 
these studies in a common metric and combine them to draw general lessons have 
been lacking. 

This article reports on a systematic review of such replication studies to assess 
the ability of NX designs to produce valid impacts of social programs on partici- 
pants' earnings.^ 

Specifically, this article addresses the following questions: 

• Can NX methods approximate the results from a well-designed and well-executed experiment? 

• Which NX methods are more likely to replicate impact estimates from a well-designed and 
well-executed experiment, and under what conditions are they likely to perform better? 

• Can averaging multiple NX impact estimates approximate the results from a well-designed 
and well-executed experiment? 

The answers to these questions will help consumers of evaluation research, includ- 
ing those who conduct literature reviews and meta-analyses, decide whether and 
how to consider NX evidence. They will also help research designers decide, when 
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random assignment is not feasible, whether there are conditions that justify a NX 
research design. 

Between- and within-study comparisons 

Researchers use two types of empirical evidence to assess NX methods: 
between-study comparisons and within-study comparisons (Shadish 2000). This 
article synthesizes evidence from within-study comparisons, but we describe 
between-study evidence as background. 

Between-study comparisons. Between-study comparisons look at multiple stud- 
ies that use different research designs and study samples to estimate the impact of 
the same type of program. By comparing results from experimental studies with 
those of NX ones, researchers try to derive the relationship between the design and 
the estimates of impact. Examples include Reynolds and Temple (1995), who com- 
pared three studies; and Cooper et al. (2000, Table 2), the National Reading Panel 
(2000, chapter I, Tables 6-7), and Shadish and Ragsdale (1996), who all compared 
dozens or hundreds of studies by including research design variables as moderators 
in their meta-analyses. These analyses produced mixed evidence on whether 
quasi-experiments produced higher or lower impact estimates than experiments. 

An even more comprehensive between-study analysis by Lipsey and Wilson 
(1993) found mixed evidence as well. For many types of interventions, the average 
of the NX studies gives a slightly different answer from the average of the experi- 
mental studies, while for some, it gives a markedly different answer. The authors 
found seventy-four meta-analyses that distinguished between randomized and 
nonrandomized treatment assignment and showed that the average effect sizes for 
the two groups were similar, 0.46 of a standard deviation from the experimental 
designs and 0.41 from the NX designs. But such findings were based on averages 
over a wide range of content domains, spanning nearly the entire applied psychol- 
ogy literature. Graphing the distribution of differences between random and 
nonrandom treatment assignment within each meta-analysis (where each one per- 
tains to a single content domain), they showed that the average difference between 
findings based on experimental versus NX designs was close to zero, implying no 
bias. But the range extended from about -1.0 standard deviation to +1.6 standard 
deviations, with the bulk of differences falling between -0.20 and +0.40. Thus, the 
between-study evidence does not resolve whether differences in impact estimates 
are due to design or to some other factor. 

Within-study comparisons. In a within-study comparison, researchers estimate 
a program s impact by using a randomized control group and then reestimate the 
impact by using one or more nonrandomized comparison groups. We refer to these 
comparisons, described formally below, as ''design replication" studies. The 
nonrandomized comparison groups are formed and their outcomes adjusted by 
using statistical or econometric techniques aimed at estimating or eliminating 
selection bias. Design replication studies can use multiple comparison groups or 
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the same comparison group with multiple sample restrictions to examine the effect 
of different comparison group strategies. The NX estimate is meant to mimic what 
would have been estimated if a randomized experiment had not been conducted. If 
the NX estimate is close to the experimental estimate, then the NX technique is 
assumed to be "successful" at replicating an unbiased research design. 

Within-study comparisons make it clear that the difference in findings between 
methods is attributable to the methods themselves rather than to investigator bias, 
differences in how the intervention was implemented, or differences in treatment 



NX [nonexperimental] methods are less 
intrusive and sometimes less costly than 
controlled experiments, hut their validity 
rests on untestahle assumptions. 



setting. For this reason, within-study comparisons can yield relatively clean esti- 
mates of selection bias. On the other hand, it is more difficult to rule out the effects 
of chance for a given set of within-study comparisons. Therefore, general conclu- 
sions require, as in this article, several within-study comparisons in a variety of 
contexts. 



Design replication to estimate bias 

The current review differs from standard meta-analysis because the "effect size" 
of interest is not the impact of some intervention on a given outcome but the dis- 
crepancy between experimental and NX impact estimates. This, we argue, is itself 
an estimate of the bias. Bias can never be directly observed, because the true 
impact, 9, is not known. This review includes two equivalent types of studies that allow 
us to estimate the bias empirically. The first type presents up to K NX estimators, 
9^, of the impact, where = 1, . . . K, and one experimental estimate, 9q , such that 
£[9q] = 9. The second type compare^ average outcome for a control group, Yq, 
with the (adjusted) average outcome, Y^, for some comparison group based on NX 
method k. The relationship between these variables is shown in equations (1) and 
(2), where Yj represents the average outcome for the treated group and B(9^ ) is the 
bias. _ _ 

e, = Y,-Y, (1) 

e, = Y,-Y, (2) 
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Using these estimates, we can estimate the bias associated with each of the k 
estimators, defined as jB(9^) = E[9^ - 0]. Since the true parameter is not observ- 
able, we estimate the bias as the difference between the NX and experimental esti- 
mators. Subtracting equation (2) from equation ( 1 ) yields two forms of the bias esti- 
mate, B(9 ^ ), corresponding to the two types of reporting formats discussed above: 

(e,-e,) = (Y,-Y,)-B(e,). (3) 

Thus, the two types of studies are equivalent, even though the latter type does not 
use information from the treatment group. 

If the experiment is well executed, then the estimated bias should itself be unbi- 
ased, as shown in equation (4). 

E[m,)] = £[e J - E[%] = E[Q, - 0] = B(e,). (4) 

The goal of the analysis in this review is to model 5(0 ^ ) as a function of the char- 
acteristics and context of the study, the estimator, and the intervention whose 
impact is being estimated. We recognize an important practical limitation in esti- 
mating such models, which is that the observed bias estimates vary not only 
because of the performance of the NX method (in reducing selection bias) and 
other contextual variables already noted but because of random sampling error in 
both the experimental and NX estimators. This sampling variance makes it difficult 
to judge when bias estimates are large enough or spread out enough to be evidence 
that a NX method has failed. Therefore, we have refrained throughout the analysis 
from making general statements that go beyond the collection of case studies we 
reviewed. 



Data and Method 

In recent years, the number of design replication studies has been growing to 
the point where it is now possible to begin synthesizing the results to look for pat- 
terns. This article draws on such a synthesis of the design replication hterature, 
focusing on studies that used earnings as an outcome.^ The rest of this section 
describes the methods we used to assemble the design replication studies, con- 
struct the data set, and conduct the analysis. 

Inclusion criteria and search strategy 

To be included in the review, a study had to meet the following criteria: 

• A randomized control group was used to evaluate a program, and a comparison group was 
available for computing at least one NX estimate of the same impact. 
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Because some studies estimated bias directly by comparing comparison and con- 
trol groups, the presence of a treatment group is not required. 

• The experimental-NX comparison was based on estimates from the same experiment. 
This criterion excludes the between-study comparisons described in Section I. 

• The experimental and NX estimates pertained to the same intervention in the same sites . 

This criterion excludes, for example, a study of education programs in Bolivia 
(Newman et al. 2001), which compared findings from an experimental design in 
one region with findings from a NX design in another. Such a study confounds 
regional differences with differences in study design. 

• The intervention's purpose was to raise participants' earnings. This criterion restricts our 
focus to programs that provide job training and employment services.'' 

The search process produced dozens of candidate studies. We narrowed them 
down to thirty-three for closer examination, and determined that twelve, hsted in 
Table 1, met the search criteria. The twelve studies correspond to nine interven- 
tions; four of these studies addressed the same intervention, the National Sup- 
ported Work Demonstration (NSW). All of the interventions involved job training 
or employment services, such as job search assistance or vocational rehabilitation, 
and participation was mandatory in about half of them. In terms of location, three 
interventions were single-site programs (in San Diego, California; Riverside, Cali- 
fornia; and Bergen, Norway); one was a multisite program in a single state 
(Florida); and the remaining five were multistate in the United States. Half of the 
interventions were studied in the 1990s; only one (NSW) was studied before 1980. 
Seven of the studies appeared in peer- reviewed journals or in books, three are final 
reports of government contractors, and two are working papers or unpublished 
manuscripts. 

The quality of the evidence in these studies — in particular, the quality of the 
experiment — is critical to our analysis. The use of design replication as a validation 
exercise assumes that the experimental estimators in the studies are themselves 
unbiased.^ Common threats to the vahdity of the experimental estimator include 
differential attrition or nonresponse, randomization bias, spillover effects, substi- 
tution bias, John Henry effects, and Hawthorne effects.^ Bias could also arise from 
nonuniform collection of data from treatment and control groups and from assign- 
ments that were not truly random. Noncomphance with treatment assignment, 
even if monitored and documented, can threaten an experiment s ability to answer 
interesting policy questions. 

To evaluate the validity of the experimental estimates, we assessed the nine 
experiments in our review and found them to be of generally high quality. Most 
were well funded and were carried out by research organizations with established 
track records in random assignment and data collection. The Manpower Demon- 
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stration Research Corporation ( MDRC) oversaw random assignment in four of the 
experiments; Abt Associates oversaw two; and Mathematica PoHcy Research 
(MPR) oversaw two. The remaining experiment was overseen by university-based 
researchers. Because details of the experimental designs and their implementation 
were not reported in all the replication studies, we retrieved background reports 
and methodological appendixes, examined nonresponse analyses, and corre- 
sponded with researchers. We concluded that most of the experiments had rela- 
tively low crossover and attrition rates and that the attrition and nonresponse did 



[GJeneral conclusions require, as in this article, 
several within-study comparisons in 
a variety of contexts. 



not appear to be related to treatment status in a way that would threaten the con- 
clusions' validity. 

Protocol and coding 

Once the studies were assembled, we followed a procedure laid out in a formal 
protocol (Glazerman, Levy, and Myers 2002) to extract data from the source stud- 
ies and code them for analysis. For example, the coding form had questions about 
the source of the comparison group used for each NX estimator in each study. Two 
coders read the twelve studies and extracted all the information needed for the 
analysis. They coded two studies together to ensure a consistent understanding of 
the coding instrument and process. Then, each coded a subset of the rest, with 
ample consultation built into the process to increase coding accuracy (see 
Glazerman, Levy, and Myers [2002] for details on this and other aspects of the 
research synthesis methods). We also contacted authors of nearly every source 
study to obtain clarification and, sometimes, additional data. Further details of the 
variables that were coded are mentioned below. 

Analysis methods 

The goal of our analysis is to determine how selection bias varies with the type of 
estimator employed, the setting, and the interaction between the setting and the 
type of estimator. To answer this, we model B (9 .j^ ) , the bias associated with estima- 
tor fc, as a function of the characteristics and context of the study (indexed byj) and 



This content downloaded from 128.95.155.147 on Mon, 28 Jul 2014 19:06:38 PM 
All use subject to JSTOR Terms and Conditions 



72 



THE ANNALS OF THE AMERICAN ACADEMY 



its intervention, captured in a vector labeled Z, and the characteristics of the esti- 
mator itself, captured in a vector labeled W. 

B(e.,)=/(Z„W„Z.W,). (5) 

We use the absolute value of B(Q ) on the left-hand side of the equation because a 
researcher or research synthesist wants to choose designs to minimize the bias, 
whatever its direction. An interaction between study-level and estimator-level vari- 
ables is included to capture the interplay between method and context. 

One might expect that some types of NX design perform better than others and 
that some designs are more appropriate under certain study conditions. To test 
this, it is important to describe each type of NX estimator. Of the many classifica- 
tion schemes, the most commonly used are those given by Campbell and Stanley 
(1966) and Cook and Campbell (1979). Alternative formulations by economists 
such as Heckman and Hotz (1989) and Heckman et al. (1998) are also useful for 
categorizing methods in a general way; however, we prefer to avoid forcing the 
methods into mutually exclusive categories, because many of the estimators used 
multiple approaches. Instead, we describe each estimator by a vector of character- 
istics that pertain to (1) the source of the comparison group and (2) the analytic 
techniques used to adjust for differences between the comparison group and the 
treatment population. 

Because there is a Umited range of NX designs assessed in the design replication 
literature, we must use very gross indicators to categorize NX designs. For the 
source of the comparison group, we coded three binary indicator variables: one for 
whether the comparison group is drawn from a national data set, such as Survey of 
Income and Program Participation (SIPP); one for whether the comparison group 
is based on sample members from the same geographic area as the treated popula- 
tion; and one for whether the comparison group is formed by using the randomized 
control group from a different experiment. For the type of statistical adjustment, 
we used four variables to broadly indicate (1) whether background variables were 
used as covariates in a regression model; (2) whether matching methods, such as 
stratification on an estimated propensity score, were used; (3) whether the estima- 
tor used preintervention measures of the outcome — examples include difference- 
in-differences models, fixed-effect models, or even regression or matching models 
using baseline earnings; and (4) whether the estimator was based on an economet- 
ric sample selection model. The selection modeling indicator (4) would be set to 
one, for example, if the estimator used the inverse Mills' ratio or instrumental vari- 
ables with a set of variables that were included in a model of program participation 
but excluded from the model of earnings determination. 

We constructed other indicators to identify conditions under which quasi- 
experiments were potentially more likely to replicate experiments. One set of indi- 
cators measured whether a specification test was conducted, and if so, whether the 
test would have led the researcher to avoid the estimator a priori. Another set of 
indicators measured whether the background variables used in the regression or 
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in the matching procedure were detailed, as with a survey, or sparse, as is typi- 
cally the case with administrative data. Variables included in the W vector 
include the experiment s sample size, grouped into categories for small, medium, 
and large; and the programs estimated effectiveness — effective, ineffective, or 
indeterminate. 

To estimate the average bias reduction associated with these design and context 
variables, we used both bivariate analyses (tabulations) and multivariate analyses 
(regression). Because such a small collection of studies limits the degrees of free- 
dom, we expect to find the data consistent with several competing explanations for 
why the estimated bias is high or low. The bivariate analyses use sample weights to 
account for the unequal sample sizes of the source studies, although we found that 



[T]he observed bias estimates vary not 
only because of the performance of the NX 
method (in reducing selection bias) . . . but 
because of random sampling error in both 
the experimental and NX estimators. 



weighting made little difference to the qualitative findings. Similarly, for the 
multivariate analyses, we tried alternative aggregation procedures to deal with lack 
of statistical independence among bias estimates from a single study. To minimize 
artificial replication, the regression results in the next section use the average of the 
absolute value of the bias estimates associated with each unique combination of 
design variables. For example, if one study produced eight quarterly bias estimates 
corresponding to impacts after random assignment, we aggregated them into a sin- 
gle estimate for the two-year period, as long as the policy interpretation for the two- 
year period made sense. 

A constraint on more detailed analyses than those just described was dictated by 
having just twelve replication studies. While many of these studies assessed multi- 
ple NX estimators, resulting in more than one thousand bias estimates, the overall 
diversity of designs was not as comprehensive a catalogue of quasi-experimental 
methods as those described by Cook and Campbell (1979) and others. Among 
those methods that were assessed, not every method was assessed in every setting. 
As more empirical work comes to light, more sophisticated analysis may be 
possible. 
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Results 

Our synthesis of design replication studies takes an important first step toward 
answering the research questions of this article. However, interpretation of the evi- 
dence remains a challenge. Even among authors of the studies we reviewed, there 
was no consensus on how to judge the size of differences between experimental 
and NX impact estimates. The authors differed in the extent to which they probed 
the statistical and poHcy significance of their results. Some focused narrowly on 
their own case studies; others made broader statements praising or condemning a 
NX method. Four studies concluded that NX methods performed well, four found 
evidence that some NX methods performed well while others did not, and four 
found that NX methods did not perform well or that there was insufficient evi- 
dence that they did perform well. A summary of their conclusions is given in the 
appendix. In this section, univariate analyses describe the range of bias estimates in 
the literature. Bivariate analyses then relate the absolute size of the bias to several 
explanatory factors. The multivariate analysis that follows uses regression to deter- 
mine whether the different explanations of bias overlap and whether one predomi- 
nates. Finally, we examine the distribution of the bias estimates to consider 
whether they cancel out across studies and whether their variation is due to true 
variation in the performance of NX methods or some other explanation. 

Univariate analysis 

From the twelve studies, we extracted 1,150 separate estimates of the bias, 
about 96 estimates per study. While some of the bias estimates were close to zero, 
some were very large, overestimating or underestimating annual earnings impacts 
by as much as $10,000 or more. Table 2 shows the bias estimates by study. 

The definition of a "large" bias depends on the program and the policy decision 
at stake. However, for disadvantaged workers, even a $1,000 difference in annual 
earnings is important. For example, in a benefit-cost study of Job Corps 
(McConnell and Glazerman 2001), a steady-state impact on annual earnings of 
about $1,200 was used to justify the program's expenditure levels, one of the high- 
est per trainee (about $16,500) for any federal training program. A difference of 
$800 in the annual earnings impact estimate would have completely changed the 
study's outcome and might have led to a recommendation to eliminate rather than 
expand the annual $1.4 billion program. For programs, such as the Job Training 
Partnership Act (JTPA) and the various welfare-to-work programs captured in our 
data, where both the program costs and the impacts on earnings are likely much 
smaller, a difference of $1,000 or more can make a dramatic difference in the policy 
recommendation. 

Another benchmark is the average earnings of control group members. In many 
of the studies we reviewed, the inflation-adjusted annual earnings of control group 
members was about $10,000, which includes zero earnings for nonworkers. Thus, a 
$1,000 bias would represent 10 percent of earnings, a substantial amount. 
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TABLE 2 

DESCRIPTIVE STATISTICS OF BIAS ESTIMATES BY STUDY 



Bias Estimates 
(Annual Earnings in 1996 Dollars) 

Average of 
Absolute 









Value of 




Number of 




Range of 


Average of 


Types of 


Number of 


Types of 


Study 


Estimates 


Estimates 


Estimates^ 


Estimates 


Estimates 


Bell etal. (1995) 


-723 to +5,008 


661 


813 


54 


3 


Bloom et al. (2002) 


-21,251 to +12,215 


498 


1,114 


564 


8 


Bratberg, Grasdal, 












and Risa (2002) 


-18,702 to -654 


-4,826 


2,907 


13 


5 


Dehejia and 












Wahba(1999) 


-1,939 to +1,212 


173 


4,163 


40 


4 


Fraker and 












Maynard(1987) 


-3,673 to +871 


-751 


1,103 


48 


3 


Gritz and Johnson 












(2001) 


-1,091 to +3,189 


497 


780 


48 


2 


Heckman et al. 












(1998) 


-7,669 to +8,154 


-423 


3,273 


45 


17 


Hotz, Imbens, and 












Mortimer (1999) 


-1,682 to +2,192 


-128 


585 


36 


2 


Hotz, Imbens, and 












Klerman (2000) 


-1,248 to +438 


-174 


371 


64 


6 


Lalonde (1986) 


-5,853 to +4,143 


-636 


2,849 


112 


8 


Olsen and Decker 












(2001) 


-1,548 to +1,107 


-363 


1,397 


10 


5 


Smith and Todd 












(forthcoming) 


-11,743 to +4,829 


-1,655 


4,019 


116 


6 


Total 


-21,251 to +12,215 


-637 


2,325 


1,150 


69 



a. The average of the absolute value of the bias is used to compare different research designs. 
Therefore, we calculate it by first averaging bias within design type and then averaging the results 
across the sixty-nine design types. 



As mentioned earlier, one should interpret the statistics in Table 2 with caution. 
The average of the bias estimates can be substantially influenced by outliers 
reflecting small samples or unrealistic estimators. However, the average does indi- 
cate whether the estimates are centered on zero and whether they tend to overesti- 
mate or underestimate impacts relative to the experimental benchmark. Eight of 
the twelve studies in our analysis showed that NX methods tended to understate 
impacts; four showed the opposite. All the studies included bias estimates that 
were both negative and positive, except for the one by Bratberg, Grasdal, and Risa 
(2002), in which all the econometric and matching techniques had negative bias 
estimate. As one would expect, the study with the greatest number of estimates 
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(Bloom et al. 2002) found the broadest range of estimates, with very large positive 
and negative values/ 

The absolute value of the bias provides a more direct measure of the perfor- 
mance of the NX estimator, where a smaller value always represents better perfor- 
mance. With that measure, the typical NX estimate of impact on annual earnings 
deviates from the corresponding experimental estimate by about $2,000. The aver- 
age absolute value in any one study ranged from twice that amount — as in the 
attempts by Dehejia and Wahba (1999) and by Smith and Todd (forthcoming) to 



replicate the findings of the NSW experiment using national data sets — to less than 
$600 per year, as in the two studies by Hotz and colleagues (Hotz, Imbens, and 
Klerman 2000; Hotz, Imbens, and Mortimer 1999). 

Bivariate analysis 

To begin to explain the range of NX bias, we conducted simple bivariate analyses 
examining the relationship between several possible explanatory variables and the 
size of the bias. The candidate variables are those that describe the quasi- 
experimental approach and the study in which it was implemented, including the 
source of the comparison group, the statistical method, and the quaUty of the data. 

For each value of an explanatory variable, we computed the average of the abso- 
lute value of the bias for all NX estimators with that value (see Table 3). For the 
entire sample of studies, the unweighted average of the absolute value of the bias 
associated with using NX methods was about $1,500. However, this was based on 
all 1,150 bias estimates without aggregating to account for nonindependence of the 
estimates or unequal sample size. Therefore, we constructed two sets of weights. 
The first (weight 1) gives more emphasis to estimates based on studies that had 
larger samples as measured by the number of control group members in the ran- 
domized experiment; the other (weight 2) multipUes the sample-size weight by a 
factor inversely proportional to the number of estimates for a given sample. For 
example, if a researcher used 10 different methods to estimate the same impact for 



This finding suggests that, while convenient, 
publicly available data sets at the national 
level are not the best for evaluating training 
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TABLE 3 

AVERAGE BIAS BY CHARACTERISTICS OF ESTIMATOR 



Average of Absolute Value of Bias 
Estimate (Annual Earnings in 1996 Dollars) 

Weight 2 







Weight 1 


(Sample Size, 


Explanatory Variable and Category 


Unweighted 


(Sample Size) 


Frequency) 


Entire sample 


1,477 


1,101 


1,110 


Source of comparison groups* 








Same labor market 


932 


821 


885 


Control group from another site 


843 


902 


814 


National data set 


2,817 


2,409 


2,131 


Statistical method: generaF 






Regression 


1,101 


1,010 


958 


Matching 


1,143 


828 


924 


Selection correction or instrumental variables 2,251 


2,071 


1,412 


None, simple mean differences 


z, /yi 


1,323 


1,515 


o Ldiia Liud.1 nioLnuu. Lype oi niaiciniiy 








Propensity score matching: one to one 


1,047 


739 


744 


Propensity score matching: one to many 


i,loi 


852 


929 


Other matching technique 


1,231 


1,037 


1,297 


Did not use matching 


1,750 


1,288 


1,311 


Statistical method: specification test result 








Specification not recommended 


4,027 


3,165 


2,870 


Specification recommended 


1,155 


857 


1,103 


No test conducted 


1,247 


1,047 


988 


Quality of background data: regression 






Poor set of controls 


2,336 


1,438 


1,590 


Extensive set of controls 


1,228 


1,030 


1,036 


Very extensive set of controls 


1,026 


1,008 


1,016 


Did not use regression 


2,431 


1,412 


1,589 


Quality of background data: matching 








Poor set of covariates 


1,752 


1,290 


1,313 


Extensive set of covariates 


1,392 


951 


1,330 


Very extensive set of covariates 


1,113 


802 


920 


Did not use matching 


1,750 


1,288 


1,311 


Quality of background data: overall 








Used prior earnings 


1,224 


1,040 


1,003 


Did not use prior earnings 


2,662 


1,379 


1,591 


Experimental sample size 








Small (< 500 controls) 


2,533 


2,378 


2,728 


Medium (500 to 1,500 controls) 


1,080 


1,001 


960 


Large ( > 1 ,500 controls ) 


819 


819 


800 


Experimental impact finding 








Program is effective 


2,089 


1,288 


1,276 


Program is ineffective 


1,197 


920 


1,105 


Indeterminate 


924 


1,021 


911 


Number of observations 


1,150 


1,150 


1,150 



a. Categories are not mutually exclusive or exhaustive. 
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one site or subgroup, then the corresponding bias estimates received a weight of 1/ 
10 times the sample size. Although the results vary somewhat by type of weight 
used, the qualitative conclusions drawn from them do not, so we focus on the 
results in the last column, which account for sample size and frequency of sample. 
Both weights reduce the average absolute value of the bias to about $1,100. 

Table 3 shows that some factors are indeed associated with higher and lower 
bias. As one would expect, the source of the comparison group has a role. The aver- 
age bias was lower (less than $900) when the comparison group came from the 
same labor market as the treated population or was composed of randomized con- 
trol group members from a separate experiment and higher (more than $2,000) 
when the comparison group was drawn from a national data set. This finding sug- 
gests that, while convenient, publicly available data sets at the national level are not 
the best for evaluating training or welfare programs. 

Aspects of the statistical method also were associated with the size of the bias. 
There was little difference between regression and matching methods overall, but 
some matching methods performed better than others. In particular, one-to-one 
propensity score matching had lower bias than other propensity score methods or 
non-propensity-score matching. Five of the studies (Lalonde 1986; Heckman et al. 
1998; Gritz and Johnson 2001; Bratberg, Grasdal, and Risa 2002; and Bloom et al. 
2002) included some form of econometric selection correction procedure such as 
the Heckman two-step estimator or instrumental variables estimator, but these 
methods performed poorly on average, about as poorly as using no method at all. 

Rather than examine all quasi-experimental estimators, it maybe more produc- 
tive to focus on the performance of those one would expect (in the absence of a ran- 
domized experiment) to be the best ones. To make such a priori predictions, 
researchers use specification tests, as illustrated by Heckman and Hotz (1989) in 
their reanalysis of Lalonde's (1986) rephcations of the NSW experiment. The typi- 
cal specification test applies the NX estimator to outcome data from before the 
intervention. If the estimated impacts, which should be zero since nobody has 
been exposed to the intervention, are larger than would be expected by chance, 
then the estimator is rejected, and its use is not recommended. Many of the design 
replication studies that we reviewed did not conduct specification tests. Among 
those that did, the average absolute bias of rejected estimators was nearly $2,900, 
almost three times the bias of recommended ones. This suggests, consistent with 
the findings of Heckman and Hotz, that specification testing, where feasible, can 
help eliminate poor-performing estimators. The estimated bias of the recom- 
mended estimators, however, was still large in absolute value: more than $1,000. 

Some authors (Heckman et al. 1998; Smith and Todd forthcoming) have sug- 
gested that data quaUty may be as important as the research design. By categorizing 
estimators by the richness of the background variables — used as covariates in a 
regression or as matching variables — to explain the size of the bias, we found some 
support for this claim. ^ The results in Table 3 suggest that the estimators based on a 
more extensive set of variables in a regression or matching method had lower bias. 



This content downloaded from 128.95.155.147 on Mon, 28 Jul 2014 19:06:38 PM 
All use subject to JSTOR Terms and Conditions 



NONEXPERIMENTAL VERSUS EXPERIMENTAL ESTIMATES 



79 



The most important variable to include in the variable set was prior earnings. 
Studies without it had a bias of about $1,600; those with it, $1,000.^ 

Finally, we found that the performance of NX methods was related to the sam- 
ple size and direction of impacts for the experiment. Specifically, the NX methods 
more closely replicated the experiments when the randomized control groups 
were large and when the experiments did not show the program was effective. One 
possible explanation for the large average bias (more than $2,700) in small studies 
is that the experimental impacts were not precisely estimated, so the estimate of 
bias is also not precisely estimated. Another possible explanation is the size of the 
nonrandomized comparison group, which tends to be small when the control 



There was little difference between regression 
and matching methods overall, but some matching 
methods performed better than others. 



group is small, so the larger estimated bias may reflect random noise in the NX esti- 
mate. Because the sample sizes of control and comparison groups are correlated, it 
is difficult to distinguish between these two stories. The relationship between the 
direction of the experimental impact and the size of bias suggests that a false posi- 
tive finding — concluding from the NX evidence that a program works when it does 
not — may be more common than a false negative. 

A limitation of this bivariate analysis is that the design elements listed in Table 3 
are not independent. For example, a study that uses a national data set to select a 
comparison group is likely also to use a relatively poor set of controls; this means 
that the large average bias for studies that use a national data set could be reflecting 
a poor set of controls. It is difficult to distinguish these explanations. We therefore 
proceed with multivariate regression analysis to try to disentangle the factors asso- 
ciated with lower bias. 



Multivariate analysis 

To examine the effect of research design on bias, we estimated several regres- 
sions with the absolute value of the bias in annual earnings as the dependent vari- 
able and the design attributes as explanatory variables (see Table 4). As suggested 
earlier, other types of explanatory variables could also explain bias. However, we 
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have limited degrees of freedom, so we use a parsimonious model that includes 
indicator variables for each design replication study to proxy for all the measured 
and unmeasured characteristics that vary only at the study level. Because the 
regression models focus on NX design configurations, we did not weight the indi- 
vidual bias estimates. We averaged them within design types so that each design 
type would be represented no more than once for each study This aggregation 
resulted in an analysis data set of sixty-nine bias estimates. The regression results 
are meant to be illustrative, because some of the design attributes are highly corre- 
lated with each other, the data set is very small, and the results depend on the 
regression specification used. 

Keeping these Hmitations in mind, we found the regressions largely confirm 
what one would expect. Outcomes for the various nonrandomized comparison 
groups available to evaluators are not good approximations to the counterfactual 
outcomes if left unadjusted. The intercept in the regression models shown in the 
odd-numbered columns of Table 4 represents the bias associated with raw mean 
differences, estimated to be in the range of $4,400 to $5,800 in annual earnings (see 
the first row). This coefficient is the expected bias in our sample, if one did not 
make any adjustments to a typical comparison group. In the regression models 
shown in the even-numbered columns, we include a separate intercept for each 
study a study-level fixed effect describe above. 

The entries in the next two rows of Table 4 suggest that using background data as 
either covariates or matching variables is about equally effective at reducing bias. 
These techniques reduce the bias by about $3,100 to $3,600, once we account for 
the studies' fixed effects (column 6). The sensitivity of this result to the inclusion of 
fixed effects suggests that the relative performance of regression-based designs 
versus matching designs is confounded with contextual factors. 

Combining methods is better than applying them individually. Models 5 and 6 
include an interaction term with a positive coefficient, which suggests that the bias 
reduction from these two methods is not fully additive, although there is likely 
some increased benefit from their combination. In model 5, for example, the bias 
from raw differences in means, represented by the intercept, is $5,775. This value 
is reduced to $2,550, if only regression is used, and to $3,312, if only matching is 
used (holding comparison group variables fixed at the value of the omitted catego- 
ries). If matching and regression are both used, they reinforce each other to reduce 
the bias to $1,038. 

Baseline measures of the outcome are important. This is suggested by the nega- 
tive coefficients on the difference-in-difference indicator, which equals one if the 
estimator uses preintervention earnings in any way, show that using baseline mea- 
sures of the outcome is important, as reported in the literature. For the simpler 
models 2 and 4, difference-in-difference estimators reduce the bias by about 
$1,600 in annual earnings, a reduction slightly larger than that achieved with other 
estimators. The interaction terms of difference-in-differences with the regression 
and matching (see models 5 and 6) indicate that these methods are also partially 
offsetting. 
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The one estimator that did not reduce bias at all, but in fact increased it, was the 
selection correction estimator, but this should be interpreted cautiously. Few esti- 
mates in our data were based on econometric methods such as the two-step estima- 
tor. Of these, one study (Bratberg, Grasdal, and Risa 2002) rejected the specifica- 
tion based on a hypothesis test but still reported the bias estimate, which was 
particularly large. Of the others, none produced a compeUing justification for the 
exclusion restrictions that typically justify such an approach. An exclusion restric- 
tion is an assumption that some variable is known to influence participation in the 
program (selection into treatment) but not the outcome of interest (earnings). 

The use of a comparison group that is matched to the same labor market or geo- 
graphic area reduced bias by about $600. Funders of evaluation research probably 
prefer to use large national data sets to evaluate programs, because secondary analy- 
ses are far less costly than new data collection. Our findings suggest that such a 
strategy comes with a penalty, an increase of average bias by about $1,700 (Table 4, 
column 6). 



NX methods more closely replicated the 
experiments when the randomized control 
groups were large and when the experiments 
did not show the prohram was effective. 



We coded another comparison group strategy that determined whether the 
source was a control group from another study or another site. Several studies — for 
example, those by Hotz, Imbens, and Mortimer (1999); Hotz, Imbens, and 
Klerman (2000); and Bloom et al. (2002) — compared the control group from one 
site to the control group from another site and labeled one as the nonrandomized 
comparison group. We included the "control group from another site" indicator 
variable in the regression primarily to distinguish between those studies from oth- 
ers that used comparison groups more readily available to researchers, such as eli- 
gible nonapplicants (for example, Heckman et al. 1998) or individuals who applied 
to the program but were screened out (for example. Bell et al. 1995). One might 
argue that control groups are not available to most evaluators, so the more relevant 
bias estimates are the larger ones found when the "other control group" indicator 
equals zero. 

The regression analysis described above is robust to the definition of the 
dependent variable. We conducted the same analysis using the signed value of the 
bias and found very similar results. Those results, available from the authors, show 



This content downloaded from 128.95.155.147 on Mon, 28 Jul 2014 19:06:38 PM 
All use subject to JSTOR Terms and Conditions 



NONEXPERIMENTAL VERSUS EXPERIMENTAL ESTIMATES 



83 



FIGURE 1 

DISTRIBUTION OF BIAS ESTIMATES FOR ALL TWELVE STUDIES 
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that overall, the unadjusted bias is large and negative. The regressors representing 
design features increase the bias (toward zero) in much the same way that they 
decreased the absolute value of the bias as shown in Table 4. Other dependent vari- 
ables can be used to further analyze the bias estimates. For example, we created 
two indicator variables, one for whether the NX impact estimate led to the same 
statistical inference and another for whether it led to the same policy conclusion. 
Constructing these variables required some additional information, such as the 
threshold value that would change the poUcy conclusion, but they allow us to 
include in a meta-analysis the results from a wider range of design replication stud- 
ies, including those that focus on education interventions and those whose out- 
comes are not measured in earnings. The analyses based on these binary outcome 
variables are beyond the scope of the current article and will be presented in future 
work. 



Aggregation of NX evidence 

Now we turn to the third research question, whether averaging multiple NX 
impact estimates can approximate the results from a well-designed and well- 
executed experiment. The above discussion suggests that while some factors are 
associated with lower bias estimates, a single NX estimator cannot reliably repli- 
cate an experimental one. The inability to achieve reliable replication may be due 
to bias or to sampling error in either the experiment or the quasi-experiment. How- 
ever, a possibility exists that a large enough group of NX estimates pertaining to dif- 
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FIGURE 2 

DISTRIBUTION OF BIAS ESTIMATES FROM NATIONAL EVALUATION 
OF WELFARE TO WORK STRATEGIES (NEWWS) (BLOOM ET AL. 2002) 
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ferent study sites, time periods, or interventions might replicate experiments on 
average. We therefore examined the extent to which positive and negative bias esti- 
mates cancel out. If they do, it would provide motivation for those who conduct lit- 
erature reviews to be able to accumulate a large body of NX evidence, when experi- 
ments are scarce, to draw valid conclusions. A useful way to make this assessment is 
by examining the full distribution of bias estimates for various groupings, such as by 
intervention, by method, or for a collection of interventions, and looking for a pat- 
tern of estimates that tends toward zero. 

The distribution of the 1,150 bias estimates from the twelve studies reviewed 
in this article provides a case where the bias estimates do appear to cancel out (Fig- 
ure 1). The distribution is nearly centered on zero with a slight skew The average 
bias was about -$600. Applying the weights described above brings the overall bias 
closer to zero, -$217; and removing the outliers and applying weights makes it even 
smaller, about -$97. ^^ This is a crude indicator, but it suggests, consistent with the 
work of Lipsey and Wilson (1993), that if enough NX studies are combined, the 
average effect will be close to what the experimental evidence would predict. 

Rarely, however, is the NX research used to answer such broad questions as 
whether all programs are effective. Instead, we would like to identify dimensions 
along which the bias begins to cancel out for more focused questions such as "What 
is the average impact of program X.^" For the studies reviewed in this article, the 
average bias was sometimes close to zero (see Table 2) but often was still substan- 
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FIGURE 3 

DISTRIBUTION OF BIAS ESTIMATES FROM SUPPORTED 
WORK (SMITH AND TODD FORTHCOMING) 
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tial. Each of the studies in the review addressed a single intervention, although 
some assessed more NX estimators, analyzed more subgroups, had larger samples, 
or included more sites. The distribution of bias estimates within studies — particu- 
larly studies that use multiple subgroups, study sites, or time periods, in addition to 
multiple estimators — makes this clearer. Figures 2 and 3 display the distribution of 
bias estimates for two of the studies that examined the largest number of estima- 
tors. For the first study (Bloom et al. 2002), the bias estimates are centered roughly 
on zero, with an average of -$151; but for the second study (Smith and Todd forth- 
coming), they clearly are not, with an average of -$2,563. It is possible to remove 
outliers from the estimates reported by Smith and Todd (forthcoming) to achieve 
an average bias that is closer to zero, but identifying outliers without the benefit of a 
randomized experiment as a benchmark may be difficult. The within-study evi- 
dence from the other studies (Table 2) suggests that the average bias across all 
methods, subgroups, and time periods is sometimes positive, sometimes negative, 
and often still in the hundreds of dollars. This suggests that a mechanistic applica- 
tion of a large number of NX estimators might improve the inference one could 
draw from such evidence, but not in a predictable way. Whether the average bias, 
properly weighted within and between studies, is really close enough to zero for 
policy makers, and whether the bias cancels out within a narrower domain of 
research, are questions that we plan to address as more design rephcation studies 
are completed. 
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What Have We Learned about NX Methods? 

Our preliminary review of the evidence suggests that the twelve design replica- 
tion case studies we identified, even taken together, will not resolve any of the long- 
standing debates about NX methods. From the case studies, we uncovered some 
factors that might reduce bias, but we have not identified a reliable strategy for 
ehminating it either in a single study or in a collection of studies. The findings can 
be summarized in terms of the three empirical questions posed in section I. 

Question: Can NX methods approximate the results from a well-designed and well- 
executed experiment? 

Answer: Occasionally, but many NX estimators produced results dramatically dif- 
ferent from the experimental benchmark. 

• Quantitative analysis of the bias estimates underscored the potential for very large bias. 
Some NX impact estimates fell within a few hundred dollars of the experimental estimate, 
but others were off by several thousand dollars. 

• The size and direction of the "average" bias depends on how the average is computed and 
what weighting assumptions are applied. 

• The average of the absolute bias over all studies was more than $1,000, which is about 10 
percent of annual earnings for a typical population of disadvantaged workers. 



Question: Which NX methods are more Hkely to repUcate impact estimates from a 
well-designed and well-executed experiment, and under what conditions are 
they likely to perform better? 

Answer: We identified some factors associated with lower estimated bias. However, 
even with these factors present, the estimated bias was often large. 

• The source of the comparison group made a difference in the average bias estimate. For 
example, bias was lower when the comparison group was drawn from within the evaluation 
itself rather than from a national data set, locally matched to the treatment population, or 
drawn as a control group in an evaluation of a similar program or the same program at a dif- 
ferent study site. 

• Statistical adjustments, in general, reduced bias, but the bias reduction associated with the 
most common methods — regression, propensity score matching, or other forms of match- 
ing — did not differ substantially. Estimators that combined methods had the lowest bias. 
Classical econometric estimators that used an instrumental variable or a separate predictor 
of program participation performed poorly. 

• Bias was lower when researchers used measures of preprogram earnings and other 
detailed background measures to control for individual differences. 

• Specification tests were useful in ehminating the poorest performing NX estimators. 

• Experiments with larger samples were more likely to be closely replicated than those with 
smaller samples. 

• "No impact" or indeterminate impact findings from an experiment were more nearly repli- 
cated than were positive experimental impact findings. 

Question: Can averaging multiple NX impact estimates approximate the results 

from a well-designed and well-executed experiment? 
Answer: Maybe, but we have not identified an aggregation strategy that consistently 

removed bias while answering a focused question about earnings impacts. 
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• Estimated biases were both positive and negative, and their distribution across all the stud- 
ies reviewed was centered roughly on zero. This was true both for the full set of estimators 
and for groups of estimates across all studies that used a single method, such as regression 
or matching. 

• For a given intervention, the distribution of bias estimates was sometimes centered near 
zero, and sometimes was not. 

We caution that this summary of findings gives only part of the picture, and it 
does so for a specific area of program evaluation research: the impacts of job train- 
ing and welfare programs on participant earnings. A somewhat more complete 
story can be developed in the short term as additional design repUcation studies, 
including some that are now in progress, come to light. 

In the meantime, those who plan and design new studies to evaluate the impacts 
of training or welfare programs on participants' earnings can use the empirical evi- 
dence to improve NX evaluation designs but not to justify their use . Similarly, those 
who wish to summarize a group of NX studies or average over a set of different NX 
estimates to reach a conclusion about the impact of a single program can draw on 
the design replication literature to identify stronger or weaker estimates but not to 
justify the validity of such a summary. 

Appendix 
What Did the Studies Conclude? 



An alternative way to review the literature is to summarize what the authors con- 
cluded in their own words. The twelve design replication studies divided into three 
equal groups about the value of nonexperimental (NX) methods and the degree of sim- 
ilarity between NX findings and those of randomized experiments: four studies con- 
cluded that NX methods performed well, four found evidence that some NX methods 
performed well while others did not, and four found that NX methods did not perform 
well or that there was insufficient evidence that they did perform well (see Table Al ) . 

The four studies that found positive results (evidence of small bias) qualified their 
conclusions by indicating that a researcher needs detailed background data (particu- 
larly prior earnings), overlap in background characteristics, or intake workers' subjec- 
tive ratings of the applicants they screened. 

It is important to probe the authors' conclusions further than the present discussion 
allows. The various study authors used different standards to assess the size of the bias 
and, in some cases, reached different conclusions with the same data. Furthermore, 
the studies are not of equal value. Some more reahstically replicated what would have 
been done in the absence of random assignment than others. Within studies, some of 
the estimators or comparison groups were more or less likely to have been used than 
others, absent an experimental benchmark. Some estimates were based on smaller 
samples than others. A recent summary by Bloom et al. (2002, chap. 2) describes many 
of these studies individually. Section III of this article presents a quantitative analysis of 
all the studies combined. 
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Notes 

1. This article uses the term "nonexperimental" as a synonym for "quasi-experimental," although "quasi- 
experimental" is used in places to connote a more purposeful attempt by the researcher to mimic randomized 
trials. In general, any approach that does not use random assignment is labeled nonexperimental (NX). 

2. Findings reported here are drawn from a research synthesis prepared under the guidelines of the 
Campbell Collaboration. The published protocol is available at http://www.campbellcollaboration.org/ 
doc-pdf/qedprot.pdf 

3. A broader review we have undertaken for the Campbell Collaboration includes design replication 
studies that estimate bias for other outcomes, such as student achievement, school dropout, and receipt of 
pubic assistance benefits. Forthcoming results from that study examine binary indicators for whether the 
experimental and NX estimators support the same statistical inference and whether they support the same 
policy conclusion. 

4. An important area excluded by this criterion was health-related interventions (for example, MacKay 
et al. 1995, 1998). Models of program participation, the key factor in sample selection bias, might be similar 
among education-, training-, and employment-related interventions but are likely to differ markedly for a 
medical or community health intervention. Furthermore, the outcomes would typically be very different. 
We initially applied a broader criterion that included school-related outcomes such as school dropout and 
test scores but ultimately focused on interventions with earnings as the main outcome to limit the number of 
confounding factors. A forthcoming Campbell review will draw on the wider literature. 

5. It is less important for our purposes that the experimental estimator be externally valid or that it repre- 
sent one pohcy parameter in particular (such as the effect of the treatment on the treated or the local average 
treatment effect), as long as the NX estimator purports to measure the same thing. 

6. Randomization bias results when the treatment group's experience is influenced by the presence of a 
randomized evaluation. Spillover effects result when the control group's experience is influenced by the 
presence of a treatment group. Substitution bias results when control group members are given an alterna- 
tive treatment that they would not have received absent the experiment. John Henry and Hawthorne effects 
result from members of the control and treatment group, respectively, behaving differently because they are 
aware of their inclusion in an experiment. 

7. It is important to recall that a wide range of bias estimates does not necessarily imply a wide range of 
biases, because of sampling error in both the experimental and NX impact estimates. 

8. The coding of the variables representing quality of background data (for regression or matching) nec- 
essarily involves some subjectivity. To be systematic, we applied the following criteria: if the specification 
included several quarters of baseline earnings and a large number of relevant background variables, we 
coded the quahty of the data as "very extensive." If the specification contained some baseline measure of 
earnings and a set of individual background variables that captures the key elements that are likely to affect 
outcomes, then it was coded as "extensive." Otherwise, it was coded as "poor." 

9. Some researchers such as Bloom et al. (2002) and Smith and Todd (forthcoming) tried to determine 
the number of quarters of prior earnings needed to reduce bias to acceptable levels, but there were not 
enough other examples to draw any general lessons. 

10. The study population for Bratberg, Grasdal, and Risa (2002) differs from the populations targeted in 
the other studies under review not only because the population was made up of Norwegians but also because 
the sample members were not disadvantaged workers. The larger bias estimates would apply to a larger earn- 
ings base and therefore not be as substantively important as a similarly sized bias found in a study of U . S . wel- 
fare participants. Some of this effect is measured by the study fixed effect (see the even-numbered columns 
in Table 4). 

11. Removing the outhers in this case is probably reasonable because the outlying bias estimates corre- 
spond to NX impact estimates that were implausible on their face (given the collection of other impact esti- 
mates). One cannot count on being able to identify this type of outlier in general, when an experimental 
benchmark is not available. 

12. We also examined the distribution across studies for a given method — matching and regression — and 
found a similar result. This suggests that aggregation need not be done across methods if a large collection of 
studies and interventions is used. 
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